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ABSTRACT 

Generalizability theory provides a technique for 
accurately estimating the reliability of measurements. The power of 
this theory is based on the simultaneous analysis of multiple sources 
of error variances. Equally important, generalizability theory 
considers relationships among the sources of measurement error. Just 
as multivariate inferential statistics consider relationships among 
variables that univariate statistics cannot detect, generalizability 
theory considers relationships of error measurement that classical 
theory cannot. An extensive discussion of the concept of reliability 
and its use in classical test theory and generalizability theory is 
presented. A comparison of classical test theory and generalizability 
theory illustrates how generalizability theory subsumes all other 
reliability estimates as special cases. A hypothetical data set 
provides examples of when the failure to use generalizability theory 
can lead to seriously erroneous estimates of test reliability. The 
framework of generalizability theory incorporates two stages of 
analysis: (1) a generalizability study; and (2) a decision study. The 
former analyzes the extent to which results are generalizable to a 
population, while w he latter uses information from the 
generalizability study to determine other generalizability 
coefficients for variations of the measurement protocol. Six data 
table* are provided, and an appendix presents the GENOVA program code 
used. (TJH) 



* Reproductions supplied by EDRS are the best that can be made 
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ABSTRACT 

General izabil ity theory provides a technique for most 
accurately estimating the reliability of measurements. The 
power of generalizability theory is based on the simultaneous 
analysis of multiple sources of error variances. A comparison 
of classical test theory and generalizability theory illustrates 
how generalizability theory subsumes all other reliability 
estimates as special cases. Further, a hypothetical data set 
provides examples of when the failure to use generalizability 
theory can lead to seriously erroneous estimates of test 
rel iabi 1 i ty . 



WHY GENERALIZABILITY THEORY YIELDS BETTER RESULTS 
THAN CLASSICAL TEST THEORY 
Behavioral measurements that yield reliable results are of 
paramount importance for social scientists. Ghiselli (1964) 
suggests that quantitative descriptions which compare traits 
among and within individuals must give a precise 
characterization of an individual in order to be very useful. 
Nunnally (1982, p. 1589) notes that 

Science is concerned with repeatable exper iirents . if 
data obtained from experiments are influenced by random 
errors of measurement, the results are not exactly 
repeatable. Thus, science is limited by the 
reliability of measuring instruments and by the 
reliability with which scientists use them. 
Historically, reliability of measurements has been determined by 
theory first articulated decades ago. This body of thought has 
come to be called classical test theory. Reliable information 
about individual differences is obtained by measurements that 
have minimum amounts of error variance and maximum amounts of 
systematic variance. Within classical test theory various 
coefficients are available for investigating single sources of 
error variance. 

However, in classical theory consideration of multiple 
sources of error variance within one analysis is unavailable. 
The inability to analyze more than one source of error variance 
at a time severely limits classical test theory as a 
psychometric technique, with the conceptualization and 



development of genera lizabi 1 ity theory (Cronbach, Gleser, Nanda , 
& Rajaratnam, 1972) the limitation of classical test theory, 
i.e., the inability to examine multiple sources of error 
var iance s imultaneously, was resolved • Further , 

generalizability theory provides a technique for more accurately 
estimating the reliability in measurements. 

Reliability within classical test theory refers to how 
consistently test scores are measured under various 
circ -instances (Gronlund, 1985; Nunnally, 1972) . The importance 
of reliable measurement lies in how much confidence can be 
placed in the results. A measurement that consistently yields 
similar results over several administrations is dependable. 
Decisions based on the results can be made with confidence. 
However, unreliable results indicate the presence of error in a 
measurement. Data obtained from such a measurement are not 
dependable and consequently are of little value. Yet, the quest 
for the perfect test is a futile one. Measurement theorists 
suggest that no perfectly reliable measurement exists (Ebel & 
Frisbie, 1986; Nunnally, 1972). Inconsistency or measurement 
error is present in all instruments. 

There is some confusion concerning the referent for 
reliability coefficients. Frequently, an instrument or test is 
referred to as having reliability. Such references are, 
strictly speaking, incorrect. Data, not a test, have the 
characteristics of reliability. To illustrate, if the 
Scholastic Achievement Test (SAT ) was administered in a city 
that had suffered a devastating tornado the day before, test 
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scores would likely not reflect true abilities of the examinees. 
Concentration on the SAT would have been hindered by the 
emotional upheavel and fatigue caused by the disaster. Thus, 
the data from the SAT would be unreliable, not the SAT. 
Similarly, a high school history exam given immediately before 
an important pep rally for a football district championship 
might appear unreliable. in an otherwise dependable test, low 
reliability must be attributable to the test results rather than 
the actual test itself. in addition to these factors, other 
common factors that contribute to inconsistency of test scores 
are anxiety and guessing. 

Understanding the functional characteristics of reliability 
leads researchers to recognize that measurement error can also 
invalidate significance testing. Nunnally (1972) states that 
attentuation occurs because measurement error tends to reduce 
correlations, i.e., makes them closer to zero. Thus, 
measurement error obscures true effect sizes. For example, if a 
student with a high ability for geography was measured by an 
unreliable geography test, the observed score would not reflect 
the student's actual ability. And systematic effects in 
experimental investigations would have been blurred by the 
prescence of error variance. Researchers strive to eliminate 
error so that observed scores reflect the actual capabilities of 
students, not extraneous factors. Thus, as information related 
to measurement error sources is gained, the greater are the 
chances of reducing measurement error and of detecting 
systematic influences via significance testing. 



Generalizability theory considers the multiple sources of 
error that may influence scores. Although introduced as 
generalizability theory by Cronbach, Gleser, Nanda, and 
Rajaratnam (1972), related developments upon which 
general izabil i cy theory was based were reported earlier (Hoyt, 
1941; Lindquist, 1953; Medley & Mitzel, 1963). Generalizability 
theory provides the framework to simultaneously examine multiple 
sources of error variance. By so doing, measurement reliability 
can be more optimally maximized through better informed test 
revision. 

The purpose of the present paper is to provide an 
introduction to the powerful measurement theory called 
generalizability theory. A comparison of classical test theory 
and generalizability theory will illustrate how generalizability 
theory subsumes all ether reliability estimates as special 
cases. in addition, the paper will demonstrate that failure to 
use generalizability theory can lead to seriously erroneous 
estimates of test .^liability. 

Classical Test Theory 

In classical test theory observed score variance is 
partitioned into true score variance and error variance. if 
tests were perfectly reliable, true score variance would equal 
observed score variance. However, since several error factors 
exist, classical theory provides estimates for at least three 
types of reliability: internal consistency, stability, and 
equivalence. Each reliability estimate considers one source of 
error either error in items, test occasions, or test forms. The 
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estimated reliability is represented by a coefficient which 
indicates the ratio of true score variance to total observed 
score var iance . 

Clarification of the types of reliability coefficients will 
be facilitated by the inclusion of a hypothetical measurement 
situation. A subseouent general izabi 1 ity analysis will be 
performed on the same data to provide a comparison of the two 
theor ies . 

The hypothetical example depicts a researcher attempting to 
establish a measurement protocol that reliably assesses college 
student attitudes towards the teaching profession. The 
instrument is administered to college seniors majoring in 
education. The researcher, being rather ambitious, hopes for a 
lucrative future as a psychometr ician and therefore exerts 
sufficient energy to design two parallel fotms of the same test. 
In a pilot study, six students are each administered the two 
parallel forms in which each form contains a different set of 
five items, i.e., items are nested in each test. Each of the 
forms is administered on two occasions. A hypothetical data 
set, dercribing the example, is presented in Table 1. The small 
sample size of the data set was intentional so readers who wish 
to pursue the paper's purpose may replicate these analyses. 



INSERT TABLE 1 ABOUT HERE. 

Classical reliability coefficients for the research 
situation are presented in Table 2. Each of the three types of 
reliability examines a separate source of variance. Internal 
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consistency examines the homogeneity of performance of items 
within a test. In internal consistency analysis the items can 
be evaluated in a variety of ways depending on the coefficient 
selected, e.g. f split-half, coefficient alpha f and 
Kuder-Richardson 20 (Ebe] & Frisbie f 1986). A high reliability 
suggests that the items are homogeneous with respect to 
statistical characteristics of interest. Table 2 presents four 
internal consistency reliabilities analyzed on each of the 
parallel forms over the two occasions. The hypothetical young 
researcher in the senario has found the varying reliabilities 
disturbing. Form A's reliability estimate is 0.48 on occasion 
one but 0.91 on occasion two. Cognizant that expertly designed 
tests often yield reliability coefficients of 0.90 or higher 
(Ebel & Frisbie, 1986; Nunnally, 1972) f the researcher is 
perplexed about the true reliability of the data from Form A. 
In addition, the reliability coefficients of Form B are more 
stable but offer no consolation because of the estimates of 0.72 
and 0.74 indicate the presence of substantial measurement error. 



INSERT TABLE 2 ABOUT HERE. 

Undaunted by the confusing internal consistency 
reliabilities, this ambitious hypothetical researcher continues 
with other reliability analyses. A second reliability estimate 
was obtained by evaluating stability. Stability of a test 
scores is estimated to determine how stable an instrument is 
ovor time. A high degree of reliability indicates that 
measurements given on two occasions ar* relatively the same. 



However, a low reliability coefficient, indicates instability is 
present. The stability coefficients of 0.93 and 0.88 are 
presented in Table 2. Encouraged by the high reliability, the 
researcher infers that student attitudes have remained 
consistent over the two occasions. 

The researcher is now confidently ready to test for a third 
type of reliability, equivalence. Equivalence reliability 
indicates the degree to which parallel forms of a test measure 
the same domain of interest. By correlating the scores of the 
six students, each taking Form A and Form B, an equivalence 
reliability coefficient is obtained. High reliability would 
indicate that the rank ordering on the two forms rema ined 
relatively unchanged and that the parallel forms could be used 
interchangeably with confidence. Unfortunately for the aspiring 
psychometrician, Form A and Form B were estimated to have a 
relatively unstable equivalence reliability. The coefficients 
of 0.88 and 0.72, presented in Table 2, indicate that the two 
forms were measuring somewhat different attitudes in regard to 
the teaching profession. 

The researcher has now obtained three types of reliability 
coefficients: internal consistency, stability, and equivalence. 
However, the coefficients have confused rather than clarified 
the reliability of the attitude measure because different 
estimates yield contradictory results. The coefficients of 
internal consistency and equivalence perplex the researcher as 
to what to do to increase the reliability of the attitude 
measurement. Zealous to become famous, the researcher 
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determines by reading the scientific journals that 
generalizability theory is more appropriate than classical 
theory and can better address the inconsistencies presented by 
the data. 

General izability Theory 

Generalizability theory (G theory) subsumes classical 
theory as a special case. G theory encompasses the concepts of 
classical theory as well as accomodating complex measurement 
designs. The power of G theory lies in the consideration of 
mul t iple sources of error variance simultaneously. Classical 
test theory is limited to analyses of single sources of error 
variance (Thompson, 1989a; Webb, Rowley, & Shavelson, 1988). 

The two theories estimate measurement characteristics using 
different frameworks. A reliability coefficient in classical 
theory concerns the dependability of an instrument or procedure 
that is to be used on different occasions or with different 
forms. If over several test administrations the results remain 
relatively the same, the instrument is said to yield reliable 
information. Contrastingly, G theory looks not at how reliable 
an instrument is over varying situations but rather how 
general izable the results are to a universe. A generalizability 
coefficient represents the ratio of universe score variance 
(systematic variance) to observed score variance. The 
fundamental differences between classical test theory and 
generalizability theory have been stated by Shavelson, Webb, and 
Rowley (1989, p. 922) : 

The concept of rel iabil ity, so fundamental to classical 
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theory, is replaced by the broader and more flexible 
notion of general izabil ity. Instead of asking how 
accurately observed scores reflect their corresponding 
true scores, general izabi 1 ity theory asks how 
accurately observed scores permit us to generalize 
about persons' behavior in a defined universe of 
situations . 

The framework of general izabi 1 ity theory incorporates two 
stages of analyses. The first stage analyzes the degree that 
results are general izable to a population and is termed a 
general izabi 1 ity stud y (G study). The second stage, decision 
study (D study) , uses information from the G study to determine 
other general izabi 1 ity coefficients for variations of the 
measurement protocol, in other words, a G study estimates 
magnitudes of error variance and a D study uses the information 
to determine the best measurement design to get the most 
reliable scores in the most efficient manner. 

The conceptual foundation of a G study is based on a 
universe of admissible observat ions . This universe is an 
infinite set of conditions from which the sampling is 
representative. Within the universe of admissible observations 
are variables or areas of measurement called facets . Facets 
provide information about the multiple sources and amounts of 
error in a measurement. Facets can be of many types. Items, 
tests, occasions, raters, or observers are facets typically of 
interest to researchers. For example, in a G study designed to 
measure the oral English proficiency of foreign teacning 



assistants, the facets of raters and occasions formed the 
universe of admissible observations (Bolus, Hinofotis, & Bailey, 
1982). Facets are samples fro.n a universe of all possible 
items, tests, occasions, raters, or observers, i.e., from the 
universe of admissible observations. Further, each facet is 
composed of condit ions which vary. Thus, a G study takes into 
consideration a representative sample from a population of 
factors or variables, i.e., facets, with each having a range of 
conditions. Shavelson, Webb, and Burstein (1986) present 
generalizability studies that illustrate these issues. 

G Study Analyses 
Bringing the previous example of the tenure seeking 
researcher into a generalizability context, the measurement 
design provides concrete examples ->f the terms germaine to 
aeneralizabi lity theory. The researcher, somewhat mystified and 
weary from estimating individual reliabilities in the classical 
approach, hopes to salvage the remains of previous efforts. To 
further pursue the noble goal of a highly general izable attitude 
measure, the researcher determines that the universe of 
admissible observations of the G study will contain the facets 
of items, forms, and occasions. After careful thought, the 
researcher defined the facets. Items would reflect attitudes 
toward the teaching profession with two forms of the test given 
on two occasions three weeks apart. 

The development of a comprehensive design was due to the 
researcher's extensive knowledge newly coined from the library. 
Coefficients of generalizability can only be estimated to the 
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degree that the universe of admissible observations has been 
defined (Brennan, 1983; Shavelson, Webb, & Rowley, 1989). 
Desiring the highest general izabi 1 ity, the researcher optimizes 
the research design by including all facets that could affect 
generalizability. For example, without testing for error from 
forms or from more than one occasion, information is 
unattainable as to the error that may originate from this source 
(Thompson, 1989a). In summary, for a G study to provide the 
most accurate estimate of generalizability, ai? facets, 
representing error variance within the measurement design, must 
be included in the analysis. 

One additional generalizability term not previously 
introduced is object of measurement. Object of measurement 
usually refers to persons and in the above scenario specifically 
refer? to senior education students. However, in a study on 
school-level variables, schools were the object of measurement 
(O'B^ie:! S Jones, 1986). An object of measurement is the 
variance which the researcher considers legitimate, e.g., 
student ability variations on a posttest in an experiment, and 
about which the researcher wishes to generalize. Facets contain 
error variance. Objects of measurement contain systemat ic 
variance and are analogous to the classical true score variance. 
In generalizability the estimated variance component for persons 
is the universe score variance. The remaining variance 
components represent error variance. 

A G study employs the statistical procedure of analysis of 
variance (ANOVA) to estimate variance components. Variance 
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components are central to the framework of general i zabi 1 ity 
theory. Brennan (1983) suggests the importance of variance 
components: "general izabi 1 ity theory emphasizes the estimation, 
-*ss, and interpretation of variance components associated with 
universes" (p. xiii). For several years variance components 
were employed in statistical analyses (Guilford, 1950), 
However, the use of mean squares from which variance components 
are determined changed as F statistics and F tests became more 
popular (Brennan, 1983). The overriding concern of researchers 
became statistical significance testing. The importance of such 
tests appear to be prevalent today. in a recent article, 
Thompson (1989b) suggests that too many researchers attend only 
to statistical significance disregarding other important issues 
such as effect size and replicability. Consequently, 
researchers may be unfamiliar with the use of mean squares for 
estimating variance components. Nevertheless, the concept of 
estimated variance components, not statistical significance, is 
important in genera lizability theory. 

The hypothetical researcher's measurement study 
incorporates 6x2x2x5 design with items nested within the 
tests. Nested items (l:T) refer to each person responding to a 
different set of items for each test (the score of person P on 
item I nested in both test T) in contrast to a crossed design 
where each person would respond to the same items on each test 
(the score of person P on item I in both test Tl and T2). 
Partitioning through a factorial ANOVA provides estimated 
variance components for the sources of variation in this 
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example: Persons (P) , Occasions (0) , Tests (T), and items 
nested in the Test (I:T), the two-way interactions PO, PT, PI:T, 
OT, and 01 :T, and the three-way interactions POT and P0I:T and 
error. Table 3 presents the ANOVA for Table 1 data. The GENOVA 
computer program was used to calculate the analysis (Brennan, 
1983) . 



INSERT TABLE 3 ABOUT HEHE. 

Using an ANOVA, specifically using the mean squares, a G 
study determines the estimated variance components. Of concern 
within these various methods of estimating variance components 
is a means to treat negative estimates. Estimates with negative 
variance sometime occur but are conceptually not possible. 
Variance can never be negative. Several methods are available 
to calculate variance components and to resolve negative 
estimates (Shavelson, Webb, & Rowley, 1989). in some methods 
the components are converted to zero. GENOVA uses the two 
methods of (a) algorithms and (b) expected mean square equations 
(EMS) to estimate variance components. These estimates are 
presented in Table 4. Thompson (1989a) provides a non-technical 
discussion with mathematical examples of variance components. 



INSERT TABLE 4 ABOUT HERE. 

The generalizability calculations for the data in Table 1 
are presented in Table 5. Since the objective of the 
researcher's measurement was to obtain scores refleccing 
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individual differences of attitudes toward the teaching 
profession, a relatively large variance component (0.59) for the 
object of measurement, persons, was reassuring. The astute 
researcher knows that in an accurately measuring instrument most 
of the observed variance is systematic variance. Drawing more 
careful consideration from the researcher were the error 
components. Although seven of the variance components reflected 
little or no error, three components were troublesome. A 
two-way interaction between persons and items involved an error 
component of 0.65. The relatively large component suggested 
that persons were inconsistent in their attitudes across items. 
Another variance component represented the three-way interaction 
of persons by occasions by items (0.33) was troublesome. 
Interactions, especially three-way interactions, are difficult 
to explain. For this data set, the explanation may have been 
that individual attitude items by individual persons across 
occasions lacked consistency. The final variance component 
which reflected error in measurement was the main effect 
variance component for test (0.17). The estimate indicated that 
Form A and Form B were correlated, but not so highly as might be 
hoped. With all of the information from the variance 
components, the researcher is anxious to obtain the long awaited 
coefficient of general izabi 1 ity. Was fame and fortune just one 
coefficient away? 



INSERT TABLE 5 ABOUT HERE. 
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However, before the researcher's curiosity could be 
satisfied, another theoretical source of great importance became 
apparent. The researcher became aware of two types of G 
coefficients. One important feature of general izabi 1 ity theory 
that classical test theory is unable to address is the 
distinction between relative and absolute decisions (Shavelson, 
Webb, & Rowley, 1989). Relative decisions are based solely on a 
person's rank order within a group, such as a score in the 90th 
percentile on a norm-referenced test. For instance, the 
California Achievement Test provides percentiles for the purpose 
of comparing the ability of one student to the ability of other 
students in several academic areas. A specific score is not 
used as a reference. The researcher cares only whether the 
relative position of the object of measurement is consistent 
across measurements, and does not care about the scores per se. 

Absolute decisions, on the other hand, involve concerns 
both about consistency of relative placement and about 
consistency of placement in relation to some absolute criterion 
such as a cutoff score or reference point. Several professions 
come to mind where competence must be demonstrated in relation 
to an absolute standard. Medicai personnel, certified public 
accountants, and lawyers must achieve a passinc score before 
being granted a license to practice. Similar!/, an applicant 
for a driver's license must demonstrate a set levoi of 
competency on a driving test before legally getting behind the 
wheel of a car. For example, in a general izabi 1 ity study on 
constructing diagnostic test profiles, the major purpose of 
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testing was to assess individual status with respect to the 
knowledge domain of pronoun usage (Webb, Herman, & Cabello, 
1987). Mastery of the domain provided the basis for decisions 
from the diagnostic test. In short, the purpose of the 
measurement determines which type of coefficient is appropriate. 

Error variance for relative and absolute decisions is 
estimated using different combinations of variance components. 
A relative decision is determined only by the variance 
components that affect the relative standing of an individual in 
a group. For instance, in the hypothetical aspiring 
researcher's nested design, a relative decision was determined 
by the variance components which interact with the object of 
measurement, i.e., PO, PT, PI:T, and error. Main effect 
variance components are not reflective of the relative standing 
of an individual and are not included in the analysis. At last, 
with unbounded enthusiasm the researcher obtains from Table 5 
the generalizability coefficient, 0.86. Although not as high as 
might be hoped, the researcher accepts the coefficient with a 
degree of relief. The powerful measurement technique of G 
theory has resolved the confusing conflicting reliability 
coefficients of classical test theory by yielding one 
roefficient representing the generalizability of the attitude 
instrument. The generalizability coefficient of 0.86 represents 
the degree that scores are general izable for a relative 
decision. 

However, if an absolute decision had been the researcher's 
focus, all facet variance components including main effects 
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would have been used in the general izabi 1 ity calculations: 0, T, 
I:T, PO, PT, PI:T, OT, OI:T, POT, and POItT, and error. 
Needless to say, the researcher was relieved that absolute 
decisions would not be necessary since the general izabi 1 ity 
coefficient of 0.77 declined, as represented by phi in Table 5. 
Further discussion and formulas relating to relative and 
absolute decision are available by Brennan (1983), Gillmore 
(1983), or Webb, Rowley, and Shavelson (1988). Classical test 
theory, unlike general izabil ity theory, cannot distinguish the 
differential reliability of scores employed for relative as 
against absolute decisions (Brennan, 1983, p. 18), again 
reflecting the limits of classical theory. 

D Study Analyses 

The newly energized researcher forges ahead to the second 
stage of genera 1 izabi 1 i ty theory, the decision study (D study). 
D studies use variance components information from the G study 
to design a measurement protocol that both minimizes error 
variance and is most efficient , i.e., yields the most reliable 
scores with the least effort. Shavelson, Webb, and Rowley 
(1989, p. 925) state, "In distinguishing a G study from a D 
study, G theory recognizes that the former is associated with 
the development of a measurement procedure whereas the latter 
then appl ies the procedure. " 

A concept central to a D study is the universe of 
general izat ion . The concept refers to the universe the 
researcher wishes to generalize. A D study can include all of 
the facets in the universe of admissible observations, or a 
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reduction in one level or condition of a facet, or a facet can 
even be eliminated. However, a D study cannot include facets 
that were not present in the universe of admissible observations 
during the G study. Conditions to be sampled can vary in a D 
study but must be present in the G study so that the necessary 
variance components are available to estimate the effects of 
various changes ?n the measurement protocol. 

Within the D study analysis, the researcher alters the 
measurement design of the G study by varying the conditions of 
the facets. The analysis is performed by dividing the variance 
components estimated in the G study by the number of levels in 
their facet design. For example, one D study analyzed a design 
with one occasion, one form, and five items and yields an 
estimated generalizabil ity coefficient of 0.70. In another 
analysis with a similar design containing 10 items, the 
coefficient increased to 0.79. By increasing the items to 25 in 
one test, one occasion design, the coefficient increased to 
0.85. The improvement in the coefficient by increasing the 
items is reasonable since in the present example a large error 
component was present for person by items nested in a test 
interaction. The use of more items divides the variance from 
this measurement error source by a larger number, resulting in a 
larger estimated general izabi lity. Therefore, the D study 
provided two measurement designs by which the researcher can 
achieve a similar degree of genera 1 izabi 1 i ty; either two tests 
with five item.* each given on two occasions (0.86) or one test 
with 25 items given once (0.85). If the researcher is satisfied 
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with this outcoma, the protocol which is roost efficient or 
practical can now be selected in an informed manner. 

Failure to use G theory, however, could have led the 
researcher to very seriously erroneous estimates of test 
reliability (Thompson, 1989a). if the researcher had 
administered only Form A of the test on the first occasion, the 
classical reliability of 0.48 would have suggested to the 
researcher that the project be abandoned, in addition, if the 
researcher had measured for stability of Form A, a 0.93 
reliability estimate would have stimulated unwarranted 
confidence in the measure. Deflated estimates of reliability 
would have been obtained if only Form B's internal consistency 
on occasion one (0.73) and occasion two (0.74) had been 
computed. Importantly, a total of four of the eight 
coefficients in Table 2 would have been lower than the 
generalizability coefficient of 0.86. 

One final interesting data set will further clarify the 
discussion concerning potentially erroneous classical 
reliabilities. Table 6 presents a similar data set representing 
the same measurement design. However, subjects score identical 
results for Form A and Form B on each occasion. A classical 
reliability of equivalence yields an incredible 1 .0—perf ect ly 
reliable forms! Conversely, a G study indicates that the 
measurement's generalizability is 0.82. Error, present in the 
measurement design, went undetected by the single-source 
reliability estimate in the classical approach. Although both 
the single analysis and the identical scores in the data set are 
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unrealistic, they do demonstrate a point. Multiple sources of 
error variance are important. General izabi 1 ity theory provides 
the framework needed to determine the influence of measurement 
error. Only general izabi 1 ity theory can simultaneously consider 
all the mult iple sources of measurement error. 



INSERT TABLE 6 ABOUT HERE. 

Put differently, a researcher may calculate internal 
consistency, stability, and equivalence reliability coefficients 
to all be 0.90 for a data set, and yet the general izabi 1 ity 
coefficient for the same data might be 0.60 because only 
generalizability theory considers the interaction of measurement 
error sources. Only generalizability theory honors complex 
reality in which measurement error sources may interact to 
compound each otherl 

Conclus ion 

Measurement theory has advanced beyond classical test 
theory. A more powerful analysis, generalizability theory, 
considers all sources of error variance simultaneously. Equally 
important, generalizability theory considers relationships among 
the sources of measurement error. Just as multivariate 
inferential statistics considers relationships among variables 
that univariate statistics cannot detect, generalizability 
theory considers relationships of error measurement that 
classical theory cannot. Nunnally (1982) suggests that 
generalizability theory goes even beyond the evaluation of 
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measurement error : 

There really is no sharp borderline dividing studies 
of reliability from studies of validity. Consequently, 
the concepts and mathematical models relating to 
generalizability theory can be extended to wider, more 
important issues in the behavioral sciences than just 
the investigation of measurement error. (p. 1600) 

Thus, there is every possibility that reflective researchers 

will increasingly turn to generalizability theory as the 

measurement model of choice. 
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Table 1 
Data for Study Example 







ucca s l on 


Fo rin 






I terns 






Total 


DPDOAM 

rbKoUN 


1 


1 


A 


4 


4 


3 


5 


2 


18 








B 


<•> 
2 


<•> 

3 


2 


2 


3 


12 








A 


4 


4 


4 


5 


3 


20 








B 


*> 
2 


*> 
2 


1 


2 


3 


10 


tr £*KoUN 


Z 


1 


A 


r 

5 


<•> 
2 


5 


4 


5 


21 








B 


4 


3 


4 


4 


2 


17 






£ 


A 


r 
D 


*> 
2 


5 


4 


4 


20 








D 
D 


c 


3 


4 


5 


2 


1 9 


rCtKoUN 


J 


1 


A 


4 


3 


4 


4 


4 


19 








B 


c 
D 


3 


3 


2 


3 


16 






c 


A 


3 


4 


4 


4 


4 


19 








B 


c 
D 


4 


3 


2 


4 


18 




A 


1 


A 


4 


i 


3 


2 


2 


12 








B 


3 


i 


2 


1 


1 


8 






2 


A 


2 


2 


2 


2 


1 


T 








B 


2 


2 


1 


3 


2 


10 


PERSON 


5 


1 


A 


2 


3 


4 


2 


4 


15 








B 


3 


1 


3 


3 


2 


12 






2 


A 


2 


3 


3 


2 


2 


12 








B 


1 


1 


3 


2 


1 


8 


PERSON 


6 


1 


A 


5 


4 


5 


3 


2 


19 








B 


3 


2 


4 


5 


5 


19 






2 


A 


5 


4 


5 


4 


4 


22 








B 


2 


2 


5 


5 


5 


19 



Table 2 

Classical Test Theory Reliabilities 

Internal consistency reliabilities: 

Form A, Occasion 1 
Form B, Occasion 1 
Form A, Occasion 2 
Form B, Occasion 2 

Stability reliability: 

Form A, Occasions 1 & 2 0.93 
Form B, Occasions 1 & 2 0.88 

Equivalence reliability: 

Forms A & B, Occasion 1 0.88 
Forms A & B, Occasion 2 0.72 



0.48 
0.73 
0.91 
0.74 
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Table 3 

Random Effects ANOVA from GENOVA 
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NOTE: FOR GENERAL I ZAiil LI TY ANALYSES, F-STATISTICS SHOULD BE 
IGNORED 



Table 4 

Variance Components Estimated from Random Effects ANOVA 
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MODEL VARIANCE COMPONENTS 



USING EMS 
EQUATIONS 



STANDARD 
ERROR 



0.5895000 
•0.0100000 
0.1683333 
0.0008333 



0. 3687613 
0. 0080050 
0. 1647915 
0. 0686416 



0.0000000 
■0.0690000 

0.6450000 
■0.0200000 

0.0000000 



0. 0478755 
0.0823673 
0. 1797435 
0. 0125388 
0.0269541 



0.0616667 
0 . 3250000 



0. 0691760 
0. 0709208 



NOTE: THE 'ALGORITHM" AND "EMS" ESTIMATED VARIANCE COMPONENTS WILL BE 
IDENTICAL IF THERE ARE NO NEGATIVE ESTIMATES 
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Table 5 

Generalizability Calculations from GENOVA 



VARIANCE COMPONENTS IN TERM? OF 
D STUDY UNIVERSE (OF GENERALIZATION) SIZES 
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UNIVERSE 
EXPECTED OBSERVED 
LOWER CASE 
UPPER CASE 



VARIANCE 
SCORE 0.58950 
SCORE 0.68567 
DELTA 0.09617 
DELTA 0.18042 
MEAN 0.19853 



STANDARD 
STANDARD ERROR OF 
DEVIATION VARIANCE 
0.76779 0.36876 
0.82805 0.36650 
0.31011 0.04074 
0.42475 0.08864 
0.44556 



GENERALIZABILITY COEFFICIENT = 0.85975 ( 6.12998) 

PHI = 0.76567 ( 3.26744) 

NOTE: SIGNAL/NOISE RATIOS ARE IN PARENTHESES 
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Table 6 
Data for Example 

Occas ion Form I teras Total 
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Appendix A 
GENOVA Program Code 
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RECORDS ALL CORRELATION NEGATIVE 

* P 6 0 
+ 15 0 

P "Person" 

I "Items within Test" 

(29X.5F5.0///) 

9 

9 DEFAULT 

#W2 Form B Occasion tfl internal consistency 

RECORDS ALL CORRELATION NEGATIVE 

* P 6 0 
+ 15 0 

P "Person" 

I "Items within Test" 

(/29X.5F5.0//) 

9 

9 DEFAULT 

***3 Form A Oocaeion *2 internal consistency 

RECORDS ALL CORRELATION NEGATIVE 

* P 6 0 
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EFFECT 
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EFFECT 

NAME 

NAME 

FORMAT 

REWIND 

PROCESS 

GSTUDY 

OPTIONS 
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EFFECT 
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+ 15 0 
P "Person" 

I "Items within Test" 

(//29X.5F5.0/) 

9 

9 DEFAULT 

»M4 Form B 



Occasion V2 internal consistency 



ALL CORRELATION NEGATIVE 



test-reteet stability reliability 
CORRELATION NEGATIVE 
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RECORDS 

* P 6 0 
+ 15 0 

P "Person" 

I "Items within Test" 

(///29X.5F5. 0) 

9 

9 DEFAULT 
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RECORDS ALL CORRELATION NEGATIVE 

* P 6 0 
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1 "Items within Test" 
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9 
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f*W6 Form B 
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P "Person" 
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1 "Items within Test" 
(/29X.5F5.0//29X.5F5 
9 

9 DEFAULT 

tt*W7 Occasion #1 equivalence reliability 
RECORDS ALL CORRELATION NEGATIVE 

* P 6 0 
+ T 2 0 
+ I:T 5 0 
P "Person" 
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(29X.5F5.0/29X.5F5.0//) 
9 

9 DEFAULT 

VM8 Occasion #2 equivalence reliability 
RECORDS ALL CORRELATION NEGATIVE 

* P 6 0 
+ T 2 0 
+ I:T 5 0 
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NAME P "Person" 

NAME T "Test" 

NAME I "Items within Teef 

FORMAT (//29X,5F5.0/29X,5F5.0) 

REWIND 9 

PROCESS 9 DEFAULT 
FINISH 



